The hierarchical condition category (HCC) score is a model developed by CMS to assign risk scores for patients based on health conditions. The risk score accounts for factors such as hospital admissions for acute and chronic issues, as well as various classes of diseases and conditions ranging in severity. The national average score is the reference point, and is equal to 1. Patients in poorer health than the average beneficiary would have a score greater than 1, whereas a patient with better health than the average beneficiary would have a score closer to 0 according to the CMS HCC model.
Although it is commonly used by insurance companies to estimate future patient costs, this metric can serve as a proxy to quantify the average individual health of a community. The average HCC score across U.S. counties is roughly normally distributed, with a median value of 0.96. The distribution is slightly right skewed, and a log transform might be considered if the model violates the LINE assumptions.
summary(county_dat$average_hcc_score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6000 0.9000 0.9600 0.9625 1.0200 1.5700
length(county_dat$average_hcc_score)
## [1] 3046
ggplot() +
geom_histogram(aes(county_dat$average_hcc_score), col=1) +
ggtitle('Normality of Outcome') +
xlab('average hcc score')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 1
ggplot() +
geom_histogram(aes(log10(county_dat$average_hcc_score)), fill=2, col =1) +
ggtitle('Normality of Outcome (log)') +
xlab('log(average hcc score)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 2
The primary covariate of interest is expansion status, and average age & percent eligible for medicaid are other important factors that should be adjusted for. There seems to be a slight difference in average HCC scores in the states that have not expanded & states that have expanded Medicaid, with larger variance in the counties that have not expanded.
county_dat %>% ggplot() +
geom_violin(aes(expansion_status, average_hcc_score,
fill = expansion_status), alpha = 0.5,
draw_quantiles = c(0.5)) +
#scale_y_reverse() +
ylab('average HCC score') +
xlab('expansion status') +
annotate("text", label = "better health", x = 1, y = 0.6) +
annotate("text", label = "worse health", x = 1, y = 1.5) +
ggtitle('2018 Average HCC Scores per US County')
vars <- c("average_hcc_score",
"expansion_status",
"average_age",
"percent_male",
"percent_eligible_for_medicaid")
covars <- county_dat %>% select(vars) %>%
mutate(expansion_status = as.integer(expansion_status)) %>%drop_na()
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(vars)` instead of `vars` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
corrplot(cor(covars), tl.cex = 0.7)
mlr_red <- lm(average_hcc_score ~ as.factor(expansion_status),
data = county_dat)
summary(mlr_red)
##
## Call:
## lm(formula = average_hcc_score ~ as.factor(expansion_status),
## data = county_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34534 -0.06376 0.00466 0.05722 0.59624
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.942775 0.006765 139.351 < 2e-16 ***
## as.factor(expansion_status)2 0.002566 0.007539 0.340 0.73360
## as.factor(expansion_status)3 0.022435 0.008423 2.664 0.00777 **
## as.factor(expansion_status)4 0.030987 0.007196 4.306 1.71e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09781 on 3042 degrees of freedom
## Multiple R-squared: 0.01824, Adjusted R-squared: 0.01727
## F-statistic: 18.84 on 3 and 3042 DF, p-value: 4.171e-12
mlr <- lm(average_hcc_score ~ expansion_status +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
data = county_dat)
summary(mlr)
##
## Call:
## lm(formula = average_hcc_score ~ expansion_status + percent_eligible_for_medicaid +
## I(percent_eligible_for_medicaid^2), data = county_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30341 -0.04604 0.00457 0.04884 0.44259
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.744158 0.009241 80.527 < 2e-16 ***
## expansion_status2 0.009571 0.006050 1.582 0.113791
## expansion_status3 0.025738 0.006755 3.810 0.000141 ***
## expansion_status4 0.049787 0.005794 8.592 < 2e-16 ***
## percent_eligible_for_medicaid 1.116492 0.060283 18.521 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -0.875561 0.113795 -7.694 1.91e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07843 on 3040 degrees of freedom
## Multiple R-squared: 0.3692, Adjusted R-squared: 0.3681
## F-statistic: 355.8 on 5 and 3040 DF, p-value: < 2.2e-16
plot(mlr)
### Model Diagnostics
The log transform does not remedy the normality violation of the model.
mlr_log <- lm(log(average_hcc_score) ~ expansion_status +
# average_age + I(average_age**2) +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
data = county_dat)
summary(mlr_log)
##
## Call:
## lm(formula = log(average_hcc_score) ~ expansion_status + percent_eligible_for_medicaid +
## I(percent_eligible_for_medicaid^2), data = county_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34968 -0.04565 0.00795 0.05264 0.33740
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.283270 0.009737 -29.092 < 2e-16 ***
## expansion_status2 0.010625 0.006375 1.667 0.095684 .
## expansion_status3 0.026142 0.007117 3.673 0.000244 ***
## expansion_status4 0.052189 0.006105 8.548 < 2e-16 ***
## percent_eligible_for_medicaid 1.287246 0.063518 20.266 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -1.183687 0.119902 -9.872 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08264 on 3040 degrees of freedom
## Multiple R-squared: 0.3603, Adjusted R-squared: 0.3593
## F-statistic: 342.5 on 5 and 3040 DF, p-value: < 2.2e-16
plot(mlr_log)
Consider average age & gender in model; both terms are significant & improve Adjusted R squared value. However, the linearity assumption seems to be violated based on the residuals vs fitted values plot. Also, the negative coefficients seem to be counterintuitive (HCC score would be expected to increase with increased age, not decrease as suggested by the negative coefficient).
mlr2 <- lm(average_hcc_score ~ expansion_status +
average_age +
percent_male +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
data = county_dat)
summary(mlr2)
##
## Call:
## lm(formula = average_hcc_score ~ expansion_status + average_age +
## percent_male + percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2),
## data = county_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26029 -0.04249 0.00016 0.04124 0.39525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3482968 0.0956672 24.547 < 2e-16 ***
## expansion_status2 0.0006056 0.0053668 0.113 0.9102
## expansion_status3 0.0110246 0.0060051 1.836 0.0665 .
## expansion_status4 0.0327772 0.0051644 6.347 2.53e-10 ***
## average_age -0.0099758 0.0011070 -9.012 < 2e-16 ***
## percent_male -1.8028098 0.0623103 -28.933 < 2e-16 ***
## percent_eligible_for_medicaid 0.8197113 0.0603955 13.572 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -0.4482033 0.1047364 -4.279 1.93e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06944 on 3038 degrees of freedom
## Multiple R-squared: 0.5058, Adjusted R-squared: 0.5047
## F-statistic: 444.3 on 7 and 3038 DF, p-value: < 2.2e-16
plot(mlr2)
Figure 3: Diagnostic Plots for model without EM
Figure 3: Diagnostic Plots for model without EM
Figure 3: Diagnostic Plots for model without EM
Figure 3: Diagnostic Plots for model without EM
The model with the additional linear & quadratic % eligible for medicaid terms is needed to explain the variation in the data. The reduced model is not sufficient (F-test p < 0.05).
Additionally, the model with the age & gender terms explain more variation in the data than the model consdering only the linear & quadratic % eligible for medicaid terms.
anova(mlr_red, mlr)
## Analysis of Variance Table
##
## Model 1: average_hcc_score ~ as.factor(expansion_status)
## Model 2: average_hcc_score ~ expansion_status + percent_eligible_for_medicaid +
## I(percent_eligible_for_medicaid^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3042 29.101
## 2 3040 18.699 2 10.402 845.54 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(mlr, mlr2)
## Analysis of Variance Table
##
## Model 1: average_hcc_score ~ expansion_status + percent_eligible_for_medicaid +
## I(percent_eligible_for_medicaid^2)
## Model 2: average_hcc_score ~ expansion_status + average_age + percent_male +
## percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3040 18.699
## 2 3038 14.648 2 4.0512 420.12 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(mlr2)
## [1] -7595.246
AIC(mlr)
## [1] -6855.463
The interaction terms are not statistically significant (p>0.05) when examining whether expansion status modifies the % eligible for medicaid in each county.
mlr_em <- lm(average_hcc_score ~ expansion_status +
average_age +
percent_male +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2) +
expansion_status*percent_eligible_for_medicaid,
data = county_dat)
summary(mlr_em)
##
## Call:
## lm(formula = average_hcc_score ~ expansion_status + average_age +
## percent_male + percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2) +
## expansion_status * percent_eligible_for_medicaid, data = county_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.27287 -0.04207 0.00053 0.04186 0.38448
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 2.376045 0.096294 24.675
## expansion_status2 0.008369 0.015373 0.544
## expansion_status3 -0.003638 0.017408 -0.209
## expansion_status4 0.019711 0.014726 1.339
## average_age -0.010376 0.001115 -9.309
## percent_male -1.784617 0.062531 -28.540
## percent_eligible_for_medicaid 0.783964 0.083827 9.352
## I(percent_eligible_for_medicaid^2) -0.446826 0.105314 -4.243
## expansion_status2:percent_eligible_for_medicaid -0.037226 0.065485 -0.568
## expansion_status3:percent_eligible_for_medicaid 0.067180 0.074433 0.903
## expansion_status4:percent_eligible_for_medicaid 0.063294 0.063084 1.003
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## expansion_status2 0.586
## expansion_status3 0.834
## expansion_status4 0.181
## average_age < 2e-16 ***
## percent_male < 2e-16 ***
## percent_eligible_for_medicaid < 2e-16 ***
## I(percent_eligible_for_medicaid^2) 2.27e-05 ***
## expansion_status2:percent_eligible_for_medicaid 0.570
## expansion_status3:percent_eligible_for_medicaid 0.367
## expansion_status4:percent_eligible_for_medicaid 0.316
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06936 on 3035 degrees of freedom
## Multiple R-squared: 0.5074, Adjusted R-squared: 0.5058
## F-statistic: 312.6 on 10 and 3035 DF, p-value: < 2.2e-16
The interaction term considering effect modification between average age & percent male is significant, and also results in more intuitive coefficients that align with expected directions for increasing males & older members of the county population.
The linearity of the model is also improved with the addition of the interaction term.
mlr_em_age_gender <- lm(average_hcc_score ~ expansion_status +
average_age + #I(average_age**2) +
percent_male + I(percent_male**2) +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2) +
average_age*percent_male,
data = county_dat)
summary(mlr_em_age_gender)
##
## Call:
## lm(formula = average_hcc_score ~ expansion_status + average_age +
## percent_male + I(percent_male^2) + percent_eligible_for_medicaid +
## I(percent_eligible_for_medicaid^2) + average_age * percent_male,
## data = county_dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26087 -0.04295 -0.00021 0.04058 0.39360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.543587 1.283061 -2.762 0.00578 **
## expansion_status2 0.001065 0.005352 0.199 0.84232
## expansion_status3 0.014587 0.006040 2.415 0.01579 *
## expansion_status4 0.034507 0.005182 6.659 3.27e-11 ***
## average_age 0.049072 0.014906 3.292 0.00101 **
## percent_male 14.310687 3.357018 4.263 2.08e-05 ***
## I(percent_male^2) -7.576109 1.849526 -4.096 4.31e-05 ***
## percent_eligible_for_medicaid 0.823500 0.060687 13.570 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -0.452661 0.105487 -4.291 1.83e-05 ***
## average_age:percent_male -0.125855 0.031630 -3.979 7.08e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06919 on 3036 degrees of freedom
## Multiple R-squared: 0.5096, Adjusted R-squared: 0.5081
## F-statistic: 350.5 on 9 and 3036 DF, p-value: < 2.2e-16
plot(mlr_em_age_gender)
age <- 65
perc_male <- .5
# 1 unit change in % male
x <- 0.049072*age + 14.3*perc_male-7.57*perc_male**2 - 0.125855*age
y <- 0.049072*age + 14.3*(perc_male + 0.01)-7.57*((perc_male + 0.01)**2) - 0.125855*age
y - x
## [1] 0.066543
# 1 unit change in age
x <- 0.049072*age + 14.3*perc_male-7.57*perc_male**2 - 0.125855*age
y <- 0.049072*(age+1) + 14.3*perc_male-7.57*(perc_male**2) - 0.125855*(age+1)
y - x
## [1] -0.076783
m <- lm(average_hcc_score ~ percent_male + I(percent_male**2) +
average_age +
percent_male*average_age,
data = county_dat)
plot(county_dat$percent_male, county_dat$average_hcc_score, col = 8,
xlab = "percent male", ylab = "average hcc score")
lines(seq(.3,.7,.05), predict(m, newdata = data.frame(average_age = 65,
percent_male = seq(.3,.7,.05))),
col = 3
)
lines(seq(.3,.7,.05), predict(m, newdata = data.frame(average_age = 71,
percent_male = seq(.3,.7,.05))),
col = 5
)
lines(seq(.3,.7,.05), predict(m, newdata = data.frame(average_age = 72,
percent_male = seq(.3,.7,.05))),
col = 4
)
legend("topleft", c("65 years", "71 years", "72 years"), pch=1, col=c(3,5,4))
AIC(mlr_em_age_gender)
## [1] -7614.531
anova( mlr2, mlr_em_age_gender)
## Analysis of Variance Table
##
## Model 1: average_hcc_score ~ expansion_status + average_age + percent_male +
## percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2)
## Model 2: average_hcc_score ~ expansion_status + average_age + percent_male +
## I(percent_male^2) + percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2) +
## average_age * percent_male
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3038 14.648
## 2 3036 14.536 2 0.11155 11.649 9.127e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because the residual plot depicts a violation of homoscedasticity and the qq plot illustrates deviation from normality, robust sandwich estimation methods were used to calculate the standard errors for the coefficients in the final adjusted model. This method is more robust than the standard LSE method for estimating the standard errors, and can be used to derive more accurate confidence intervals.
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich)
coeftest(mlr_em_age_gender, vcov = vcovHC(mlr_em_age_gender, type="HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.5435873 1.6569047 -2.1387 0.032541 *
## expansion_status2 0.0010648 0.0052007 0.2047 0.837782
## expansion_status3 0.0145870 0.0057930 2.5181 0.011852 *
## expansion_status4 0.0345069 0.0051904 6.6481 3.506e-11 ***
## average_age 0.0490723 0.0189325 2.5920 0.009589 **
## percent_male 14.3106865 4.3593577 3.2828 0.001040 **
## I(percent_male^2) -7.5761093 2.3504119 -3.2233 0.001281 **
## percent_eligible_for_medicaid 0.8234996 0.0969384 8.4951 < 2.2e-16 ***
## I(percent_eligible_for_medicaid^2) -0.4526607 0.2091976 -2.1638 0.030558 *
## average_age:percent_male -0.1258554 0.0402533 -3.1266 0.001785 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Confidence Intervals
(0.0010648 + c(-1,1)*1.96*0.0052007) #95% ci for expansion status 2
## [1] -0.009128572 0.011258172
(0.0145870 + c(-1,1)*1.96*0.0057930 ) #95% ci for expansion status 3
## [1] 0.00323272 0.02594128
(0.0345069 + c(-1,1)*1.96* 0.0051904) #95% ci for expansion status 4
## [1] 0.02433372 0.04468008
\[ \begin{split} & E(\text{average HCC score}) = -3.54 \\ & 0.001*\text{I(State Expanded Medicaid in 2014)} + \\ & 0.015*\text{I(State Expanded Medicaid after 2014)} + \\ & 0.035*\text{I(State has not Expanded Medicaid)} + \\ & 0.049*\text{(average age)} + \\ & 14.31*\text{(percent male)} \\ & -7.58*\text{(percent male)}^2 + \\ & 0.82*\text{(percent eligible for medicaid)} \\ & -0.45*\text{(percent eligible for medicaid)}^2 \\ & -0.13*\text{(average age * percent male)} \end{split} \]
Is there a difference in average HCC scores in 2018 vs 2010 for counties that expanded Medicaid early versus late?
Both the hcc ratio of 2018 vs 2010 values and the difference are normally distributed with long tails on both ends. Taking the log transform of the ratio does not substantially remedy the long tail issue.
summary(county_dat_combo$hcc_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7767 0.9881 1.0204 1.0229 1.0581 1.4848
ggplot() + geom_histogram(aes(county_dat_combo$hcc_ratio), col=1) +
xlab('HCC ratio (2018 HCC/2010 HCC)') +
ggtitle('Normality of Outcome')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 1a & 1b
summary(county_dat_combo$hcc_ratio)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7767 0.9881 1.0204 1.0229 1.0581 1.4848
ggplot() + geom_histogram(aes(log10(county_dat_combo$hcc_ratio)), col=1) +
xlab('HCC ratio (log)') +
ggtitle('Normality of Outcome (log)')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 1a & 1b
summary(county_dat_combo$hcc_diff)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.2300 -0.0100 0.0200 0.0208 0.0500 0.3200
ggplot() + geom_histogram(aes(county_dat_combo$hcc_diff), col=1) +
xlab('HCC difference (2018 HCC-2010 HCC)') +
ggtitle('Normality of Outcome')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Figure 2
comp_red <- lm(hcc_diff ~ as.factor(expansion_status),
data = county_dat_combo)
summary(comp_red)
##
## Call:
## lm(formula = hcc_diff ~ as.factor(expansion_status), data = county_dat_combo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26568 -0.03225 -0.00225 0.03432 0.30467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.035681 0.003663 9.741 < 2e-16 ***
## as.factor(expansion_status)2 -0.018703 0.004082 -4.582 4.80e-06 ***
## as.factor(expansion_status)3 -0.020354 0.004539 -4.485 7.56e-06 ***
## as.factor(expansion_status)4 -0.013431 0.003895 -3.448 0.000572 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05346 on 3114 degrees of freedom
## Multiple R-squared: 0.008397, Adjusted R-squared: 0.007442
## F-statistic: 8.79 on 3 and 3114 DF, p-value: 8.396e-06
comp1 <- lm(hcc_diff ~ as.factor(expansion_status) +
delta_medicaid +
delta_age +
delta_male ,
data = county_dat_combo)
summary(comp1)
##
## Call:
## lm(formula = hcc_diff ~ as.factor(expansion_status) + delta_medicaid +
## delta_age + delta_male, data = county_dat_combo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.260765 -0.031124 -0.002876 0.030114 0.247300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.022123 0.003577 6.184 7.05e-10 ***
## as.factor(expansion_status)2 -0.006265 0.003879 -1.615 0.106386
## as.factor(expansion_status)3 -0.006567 0.004263 -1.540 0.123555
## as.factor(expansion_status)4 0.013582 0.003938 3.449 0.000571 ***
## delta_medicaid 0.132407 0.005777 22.921 < 2e-16 ***
## delta_age 1.044439 0.078688 13.273 < 2e-16 ***
## delta_male 0.079834 0.030676 2.602 0.009299 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04932 on 3111 degrees of freedom
## Multiple R-squared: 0.157, Adjusted R-squared: 0.1553
## F-statistic: 96.54 on 6 and 3111 DF, p-value: < 2.2e-16
plot(comp1)
Figure 3
Figure 3
Figure 3
Figure 3
The additional quadratic delta_medicaid term improves the linearity of the final model, and is statistically significant.
comp2 <- lm(hcc_diff ~ as.factor(expansion_status) +
delta_medicaid +
delta_age +
delta_male +
I(delta_medicaid**2) ,
data = county_dat_combo)
summary(comp2)
##
## Call:
## lm(formula = hcc_diff ~ as.factor(expansion_status) + delta_medicaid +
## delta_age + delta_male + I(delta_medicaid^2), data = county_dat_combo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.247636 -0.031706 -0.002708 0.030112 0.245387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.027361 0.003793 7.214 6.80e-13 ***
## as.factor(expansion_status)2 -0.009816 0.003966 -2.475 0.01338 *
## as.factor(expansion_status)3 -0.010394 0.004355 -2.387 0.01706 *
## as.factor(expansion_status)4 0.010692 0.003992 2.678 0.00744 **
## delta_medicaid 0.136209 0.005837 23.334 < 2e-16 ***
## delta_age 1.036104 0.078518 13.196 < 2e-16 ***
## delta_male 0.086603 0.030644 2.826 0.00474 **
## I(delta_medicaid^2) -0.054369 0.013345 -4.074 4.73e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04919 on 3110 degrees of freedom
## Multiple R-squared: 0.1614, Adjusted R-squared: 0.1596
## F-statistic: 85.54 on 7 and 3110 DF, p-value: < 2.2e-16
plot(comp2)
Figure 4
Figure 4
Figure 4
Figure 4
The full model with the additional delta age, percent male, and percent eligible for medicaid is needed to explain the variation in the data.
anova(comp_red, comp2)
## Analysis of Variance Table
##
## Model 1: hcc_diff ~ as.factor(expansion_status)
## Model 2: hcc_diff ~ as.factor(expansion_status) + delta_medicaid + delta_age +
## delta_male + I(delta_medicaid^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 3114 8.8997
## 2 3110 7.5261 4 1.3736 141.9 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
AIC(comp2)
## [1] -9924.365
Because the residual plot depicts a violation of homoscedasticity and the qq plot illustrates deviation from normality, robust sandwich estimation methods were used to calculate the standard errors for the coefficients in the final adjusted model. This method is more robust than the standard LSE method for estimating the standard errors, and can be used to derive more accurate confidence intervals.
library(lmtest)
library(sandwich)
coeftest(comp2, vcov = vcovHC(comp2, type="HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0273608 0.0038189 7.1646 9.698e-13 ***
## as.factor(expansion_status)2 -0.0098163 0.0039447 -2.4884 0.0128823 *
## as.factor(expansion_status)3 -0.0103938 0.0043592 -2.3843 0.0171705 *
## as.factor(expansion_status)4 0.0106918 0.0039253 2.7238 0.0064891 **
## delta_medicaid 0.1362087 0.0064151 21.2326 < 2.2e-16 ***
## delta_age 1.0361040 0.0989823 10.4676 < 2.2e-16 ***
## delta_male 0.0866028 0.0402591 2.1511 0.0315424 *
## I(delta_medicaid^2) -0.0543689 0.0149204 -3.6439 0.0002729 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Confidence Intervals
(-0.0098163 + c(-1,1)*1.96*0.0039447 ) #95% ci for 2 vs 1
## [1] -0.017547912 -0.002084688
( -0.0103938 + c(-1,1)*1.96* 0.0043592 ) #95% ci for 3 vs 1
## [1] -0.018937832 -0.001849768
(0.0106918 + c(-1,1)*1.96*0.0039253) #95% ci for 4 vs 1
## [1] 0.002998212 0.018385388
\[ \begin{split} & E(\text{average HCC score 2018 - average HCC score 2010}) = 0.027 \\ & -0.0098*\text{I(State Expanded Medicaid in 2014)} + \\ & -0.01*\text{I(State Expanded Medicaid after 2014)} + \\ & 0.011*\text{I(State has not Expanded Medicaid)} + \\ & 1.036*\text{(delta age)} + \\ & 0.0866*\text{(delta male)} + \\ & 0.136*\text{(delta percent eligible for medicaid)} \\ & -0.0543*\text{(delta percent eligible for medicaid)}^2 \\ \end{split} \]
county_dat_combo %>% ggplot() +
geom_violin(aes(expansion_status, average_hcc_score_2010,
fill = expansion_status), alpha = 0.5,
draw_quantiles = c(0.5)) +
# scale_y_reverse() +
ylab('average HCC score') +
xlab('expansion status') +
annotate("text", label = "better health", x = 1, y = 0.6) +
annotate("text", label = "worse health", x = 1, y = 1.5) +
ggtitle('2010 Average HCC Scores per US County')
Create binary outcome categorizing hospital readmission rates & emergency department visits as high if the county rate exceeds the national average rate (hospital readmission - 18.06%, ED visits - 670 per 1000 beneficiaries)
Also, create multinomial class var with following labels:
* 4 - hospital readmission rate & % beneficiaries with an ED visit above national average
* 3 - beneficiaries w/ ED visit above national average (hospital readmission below)
* 2 - hospital readmission rate above national average (ED below)
* 1 - both hospital readmission & ED below
county_dat['high_hospital_readmission'] = ifelse(county_dat$hospital_readmission_rate >= .1806, 1, 0)
county_dat['high_ED'] = ifelse(county_dat$emergency_department_visits_per_1000_beneficiaries >= 670, 1, 0)
county_dat['hospital_ED_class'] = case_when(
(county_dat$high_hospital_readmission == 1) &
(county_dat$high_ED == 1)~ 3,
county_dat$high_ED == 1 ~ 2,
county_dat$high_hospital_readmission == 1 ~ 2,
TRUE ~ 1)
Expansion status of state is only predictive in the logistic model looking at high ED rates in counties; it is not a statistically significant predictor in the model with high hospitalization rates as the outcome (neither in binary - expanded vs non-expanded nor more verbose 4/3/2/1 classification)
mod.binary.ed = glm(high_ED ~
as.factor(expansion_status) +
percent_male +
average_age +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
family = binomial(), data = county_dat)
summary(mod.binary.ed)
##
## Call:
## glm(formula = high_ED ~ as.factor(expansion_status) + percent_male +
## average_age + percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2),
## family = binomial(), data = county_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8545 -0.7890 0.3651 0.7858 2.6277
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 29.23201 3.53935 8.259 < 2e-16 ***
## as.factor(expansion_status)2 0.48100 0.18347 2.622 0.008749 **
## as.factor(expansion_status)3 0.41179 0.20568 2.002 0.045273 *
## as.factor(expansion_status)4 0.62052 0.17606 3.524 0.000424 ***
## percent_male -24.36284 2.36449 -10.304 < 2e-16 ***
## average_age -0.32340 0.04105 -7.878 3.33e-15 ***
## percent_eligible_for_medicaid 34.78637 2.54669 13.659 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -44.50784 4.58289 -9.712 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4188.5 on 3045 degrees of freedom
## Residual deviance: 3034.8 on 3038 degrees of freedom
## AIC: 3050.8
##
## Number of Fisher Scoring iterations: 4
mod.binary.ed.cont = glm(high_ED ~
as.integer(expansion_status) +
percent_male +
average_age +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
family = binomial(), data = county_dat)
summary(mod.binary.ed.cont)
##
## Call:
## glm(formula = high_ED ~ as.integer(expansion_status) + percent_male +
## average_age + percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2),
## family = binomial(), data = county_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8605 -0.7909 0.3659 0.7819 2.6677
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 29.40714 3.53049 8.329 < 2e-16 ***
## as.integer(expansion_status) 0.12982 0.04346 2.987 0.00281 **
## percent_male -24.56220 2.35795 -10.417 < 2e-16 ***
## average_age -0.32284 0.04100 -7.874 3.44e-15 ***
## percent_eligible_for_medicaid 34.66931 2.53525 13.675 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -44.31853 4.55111 -9.738 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4188.5 on 3045 degrees of freedom
## Residual deviance: 3039.2 on 3040 degrees of freedom
## AIC: 3051.2
##
## Number of Fisher Scoring iterations: 4
Reduced model treating expansion_status as a continous covariate is sufficient; the categorical information is not needed.
anova(mod.binary.ed.cont, mod.binary.ed, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: high_ED ~ as.integer(expansion_status) + percent_male + average_age +
## percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2)
## Model 2: high_ED ~ as.factor(expansion_status) + percent_male + average_age +
## percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 3040 3039.2
## 2 3038 3034.8 2 4.4276 0.1093
exp(0.48100 + c(-1,1)*0.18347*1.96) # expansion status 2 vs 1
## [1] 1.129075 2.317760
exp(0.41179 + c(-1,1)*0.20568*1.96) # expansion status 3 vs 1
## [1] 1.008695 2.259001
exp(0.62052 + c(-1,1)*0.17606*1.96) # expansion status 4 vs 1
## [1] 1.317113 2.626357
# continuous
exp(0.12982 + c(-1,1)*0.04346*1.96)
## [1] 1.045650 1.239864
# OR - category 4 vs 1
odds4 = exp(29.40714 + 4*0.12982)
odds1 = exp(29.40714 + 0.12982)
OR4v1 = odds4/odds1 # OR
library(ResourceSelection)
## ResourceSelection 0.3-5 2019-07-22
hoslem.test(county_dat$high_ED, predict(mod.binary.ed), g = 10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: county_dat$high_ED, predict(mod.binary.ed)
## X-squared = -1948.5, df = 8, p-value = 1
hoslem.test(county_dat$high_ED, predict(mod.binary.ed.cont), g= 10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: county_dat$high_ED, predict(mod.binary.ed.cont)
## X-squared = -1981, df = 8, p-value = 1
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
roc_cont <- roc(county_dat$high_ED, predict(mod.binary.ed.cont))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_cat <- roc(county_dat$high_ED, predict(mod.binary.ed))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_cont, col = 1,lty = 4)
plot(roc_cat, add =TRUE, col = 4, lty=2, alpha =.5)
legend(0.5,0.5,
legend=c("continuous", "categorical"),
col=c(1,4), lty=c(4,2))
auc(roc_cont)
## Area under the curve: 0.8336
auc(roc_cat)
## Area under the curve: 0.8342
\[ \begin{split} & logit(\text{p(high ED usage)}) = 29.4 \\ & 0.13*\text{(expansion status)} \\ & -24.56*\text{(percent male)} \\ & -0.322*\text{(average age)} + \\ & 34.67*\text{(percent eligible for medicaid)} \\ & -44.32*\text{(percent eligible for medicaid)}^2 \\ \end{split} \]
temp <- county_dat %>% group_by(expansion_status, high_ED) %>% count(high_ED)
temp %>% ggplot() +
geom_bar(aes(x=expansion_status,
y=n, fill = as.factor(high_ED)),
position = "fill", stat='identity' ) +
ylab('% high ED usage') + xlab('Expansion Status Category') +
ggtitle('Emergency Department Usage Levels') +
scale_fill_manual(name='Usage Level',
values = c('0' = "skyblue",
'1' = "lightcoral"),
labels = c('low', 'high'))
mod.binary.hosp = glm(high_hospital_readmission ~
as.factor(expansion_status) +
percent_male +
average_age +
percent_eligible_for_medicaid,
family = binomial(), data = county_dat)
summary(mod.binary.hosp)
##
## Call:
## glm(formula = high_hospital_readmission ~ as.factor(expansion_status) +
## percent_male + average_age + percent_eligible_for_medicaid,
## family = binomial(), data = county_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2497 -0.8481 -0.6165 1.1064 2.7787
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 26.57178 3.09240 8.593 < 2e-16 ***
## as.factor(expansion_status)2 0.27834 0.17757 1.567 0.117
## as.factor(expansion_status)3 -0.27100 0.20207 -1.341 0.180
## as.factor(expansion_status)4 0.19658 0.17157 1.146 0.252
## percent_male -17.57146 2.18202 -8.053 8.09e-16 ***
## average_age -0.28662 0.03611 -7.936 2.08e-15 ***
## percent_eligible_for_medicaid 5.36320 0.65799 8.151 3.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3866.0 on 3045 degrees of freedom
## Residual deviance: 3443.4 on 3039 degrees of freedom
## AIC: 3457.4
##
## Number of Fisher Scoring iterations: 4
library(nnet)
mod.multi <- multinom(as.factor(hospital_ED_class) ~
as.factor(expansion_status) +
percent_male +
average_age +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2) +
I(average_age**2),
data = county_dat)
## # weights: 30 (18 variable)
## initial value 3346.373031
## iter 10 value 2877.441498
## iter 20 value 2687.437464
## iter 30 value 2663.789938
## iter 40 value 2663.730781
## iter 50 value 2663.721979
## iter 60 value 2663.606342
## iter 70 value 2663.091050
## final value 2663.058111
## converged
summary(mod.multi)
## Call:
## multinom(formula = as.factor(hospital_ED_class) ~ as.factor(expansion_status) +
## percent_male + average_age + percent_eligible_for_medicaid +
## I(percent_eligible_for_medicaid^2) + I(average_age^2), data = county_dat)
##
## Coefficients:
## (Intercept) as.factor(expansion_status)2 as.factor(expansion_status)3
## 2 -34.10016 0.4346568 0.37510679
## 3 53.44012 0.6711761 0.08902687
## as.factor(expansion_status)4 percent_male average_age
## 2 0.5546608 -20.56062 1.3464230
## 3 0.7291613 -36.45654 -0.7249963
## percent_eligible_for_medicaid I(percent_eligible_for_medicaid^2)
## 2 27.76756 -33.81324
## 3 38.75018 -45.32945
## I(average_age^2)
## 2 -0.011099119
## 3 0.001671282
##
## Std. Errors:
## (Intercept) as.factor(expansion_status)2 as.factor(expansion_status)3
## 2 0.0009394821 0.02741390 0.004877179
## 3 0.0011593232 0.02062059 0.002907146
## as.factor(expansion_status)4 percent_male average_age
## 2 0.03332274 0.0006059360 0.03389721
## 3 0.02444438 0.0007932748 0.04098273
## percent_eligible_for_medicaid I(percent_eligible_for_medicaid^2)
## 2 0.001034685 0.0003822978
## 3 0.001440342 0.0006571129
## I(average_age^2)
## 2 0.0004725000
## 3 0.0005751155
##
## Residual Deviance: 5326.116
## AIC: 5362.116
The model is not doing a good job classifying health outcome class based on expansion status, largely because the raw data does not have clear distinctions across the outcome categories.
county_dat['predict'] <- predict(mod.multi)
county_dat %>% ggplot() + geom_histogram(aes(predict, fill = expansion_status), stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
county_dat %>% ggplot() + geom_histogram(aes(hospital_ED_class, fill = expansion_status), stat="count")
## Warning: Ignoring unknown parameters: binwidth, bins, pad
library(VGAM)
## Loading required package: stats4
##
## Attaching package: 'VGAM'
## The following object is masked from 'package:lmtest':
##
## lrtest
## The following object is masked from 'package:gam':
##
## s
## The following object is masked from 'package:tidyr':
##
## fill
ord.all <- vglm(hospital_ED_class ~
expansion_status +
percent_male +
average_age +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
cumulative(parallel=TRUE, reverse=TRUE),
data=county_dat)
summary(ord.all)
##
## Call:
## vglm(formula = hospital_ED_class ~ expansion_status + percent_male +
## average_age + percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2),
## family = cumulative(parallel = TRUE, reverse = TRUE), data = county_dat)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## logitlink(P[Y>=2]) -6.645 -0.6259 0.2409 0.6042 9.577
## logitlink(P[Y>=3]) -2.781 -0.5718 -0.2235 0.4093 8.312
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept):1 29.13773 2.86319 10.177 < 2e-16 ***
## (Intercept):2 26.88181 2.85509 9.415 < 2e-16 ***
## expansion_status2 0.44145 0.15409 2.865 0.00417 **
## expansion_status3 0.06553 0.17235 0.380 0.70381
## expansion_status4 0.44713 0.14834 3.014 0.00258 **
## percent_male -21.75771 1.90653 -11.412 < 2e-16 ***
## average_age -0.31877 0.03295 -9.673 < 2e-16 ***
## percent_eligible_for_medicaid 28.58780 1.98938 14.370 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -36.10688 3.44578 -10.479 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Names of linear predictors: logitlink(P[Y>=2]), logitlink(P[Y>=3])
##
## Residual deviance: 5365.752 on 6083 degrees of freedom
##
## Log-likelihood: -2682.876 on 6083 degrees of freedom
##
## Number of Fisher scoring iterations: 6
##
## No Hauck-Donner effect found in any of the estimates
##
##
## Exponentiated coefficients:
## expansion_status2 expansion_status3
## 1.554954e+00 1.067720e+00
## expansion_status4 percent_male
## 1.563822e+00 3.554223e-10
## average_age percent_eligible_for_medicaid
## 7.270416e-01 2.603295e+12
## I(percent_eligible_for_medicaid^2)
## 2.084397e-16
Test proportional odds assumption - no overlap in beta estimates confidence interval. Proportional odds assumption is violated.
county_dat['ind3'] = ifelse(county_dat$hospital_ED_class == 3, 1, 0)
county_dat['ind32'] = ifelse(county_dat$hospital_ED_class == 1, 0, 1)
mod.3v12 <- glm(ind3 ~expansion_status +
percent_male +
average_age +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
family=binomial,
data=county_dat)
mod.32v1 <- glm(ind32 ~expansion_status +
percent_male +
average_age +
percent_eligible_for_medicaid +
I(percent_eligible_for_medicaid**2),
family=binomial,
data=county_dat)
summary(mod.3v12)
##
## Call:
## glm(formula = ind3 ~ expansion_status + percent_male + average_age +
## percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2),
## family = binomial, data = county_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0996 -0.7184 -0.4320 0.5547 2.8688
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 29.52740 3.75773 7.858 3.91e-15 ***
## expansion_status2 0.37280 0.20374 1.830 0.0673 .
## expansion_status3 -0.17955 0.23138 -0.776 0.4377
## expansion_status4 0.33197 0.19819 1.675 0.0939 .
## percent_male -21.77906 2.54873 -8.545 < 2e-16 ***
## average_age -0.34106 0.04262 -8.003 1.22e-15 ***
## percent_eligible_for_medicaid 22.20724 2.71752 8.172 3.04e-16 ***
## I(percent_eligible_for_medicaid^2) -26.63211 4.56332 -5.836 5.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3431.2 on 3045 degrees of freedom
## Residual deviance: 2788.8 on 3038 degrees of freedom
## AIC: 2804.8
##
## Number of Fisher Scoring iterations: 5
summary(mod.32v1)
##
## Call:
## glm(formula = ind32 ~ expansion_status + percent_male + average_age +
## percent_eligible_for_medicaid + I(percent_eligible_for_medicaid^2),
## family = binomial, data = county_dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9339 -0.7994 0.3925 0.7404 3.1234
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 30.18076 3.58293 8.423 < 2e-16 ***
## expansion_status2 0.51572 0.18499 2.788 0.005306 **
## expansion_status3 0.31301 0.20785 1.506 0.132069
## expansion_status4 0.61732 0.17673 3.493 0.000478 ***
## percent_male -24.93738 2.37775 -10.488 < 2e-16 ***
## average_age -0.31929 0.04173 -7.652 1.99e-14 ***
## percent_eligible_for_medicaid 30.46235 2.36156 12.899 < 2e-16 ***
## I(percent_eligible_for_medicaid^2) -36.09792 4.25296 -8.488 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4005.7 on 3045 degrees of freedom
## Residual deviance: 2950.9 on 3038 degrees of freedom
## AIC: 2966.9
##
## Number of Fisher Scoring iterations: 5
# ord.all_potest <- vglm(hospital_ED_class ~
# as.factor(expansion_status) +
# percent_male +
# average_age +
# percent_eligible_for_medicaid +
# I(percent_eligible_for_medicaid**2),
# cumulative(parallel=FALSE, reverse=T),
# data=county_dat)
#
# summary(ord.all_potest)